Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit HTTP request method cardinality: original_method (option A) #17

Merged
merged 12 commits into from
Jun 20, 2023

Conversation

lmolkova
Copy link
Contributor

Moving in open-telemetry/opentelemetry-specification#3478.

Changes

  • restricts http.request.method values to well-known names
  • requires to use other for unknown methods to prevent cardinality explosion if a malicious (or buggy) client sends custom and dynamic methods.
  • allows recording actual method value in the http.request.original_method

There was a discussion on the original PR and at HTTP SIG meeting, the summary and options are in the comments.

@lmolkova lmolkova requested review from a team May 12, 2023 18:15
@lmolkova
Copy link
Contributor Author

lmolkova commented May 12, 2023

Option A (currently implemented in this PR)

http.request.method represents the canonical method, values are restricted on all signals.

Example:

  • Invalid method:

    • http.request.method = other
    • http.request.original_method = XYZZ (only if different from method)
  • Valid method:

    • http.request.method = GET

Cons:

  • breaking for existing ECS users who populate it on logs.
  • it's necessary to check two fields to find the method: http.request.original_method first and if missing, check http.request.method

Possible solution: Different requirement levels for different signals

  • Logs: http.request.original_method could be required
  • Traces http.request.original_method could be conditionally required only if different from http.request.method

I.e. example of a valid request would be

  • http.request.method = GET
  • http.request.original_method is empty on traces and is GET on logs

@lmolkova
Copy link
Contributor Author

lmolkova commented May 12, 2023

Option B: http.request.method represents original method, http.request.canonical_method captures canonical one

PR: #34

Invalid method:

  • http.request.method = XYZZ
  • http.request.canonical_method = other

Valid method:

  • http.request.method = GET
  • http.request.canonical_method = GET

Cons: duplication (otherwise http.request.method would capture invalid methods only).

@lmolkova
Copy link
Contributor Author

lmolkova commented May 12, 2023

Option C: Single attribute http.request.method, but special rules for span names and metrics

PR: #35

Invalid method:

  • Traces:
    • http.request.method = XYZZZ
    • span name: HTTP other
  • Metrics: http.request.method = other
  • Logs: http.request.method = XYZZZ

Valid method:

  • Traces:
    • http.request.method = GET
    • span name: HTTP GET
  • Metrics: http.request.method = GET
  • Logs: http.request.method = GET

Cons:

  • Slightly different attribute semantics for metrics
  • Harder to do traces/logs-to-metrics aggregation

@ruflin
Copy link

ruflin commented May 15, 2023

Different requirement levels for different signals

I picked this from A but was thinking of option C when I first read this. I agree, different signals have potentially different requirements but still the same field should be used. For a trace, only valid fields should be accepted / are part of the standard. For logs the same fields is more lenient but it is still encouraged, to be strict.

With the above, we could get the best of both worlds. There is a way to express in the schema requirements for different signals but still optimise for bringing all the different signals together. It is then up to the storage engine / query engine of the different implementations to see how they deal with the mixed data if signals should be brought together.

The part I'm most worried about A is:

it's necessary to check two fields to find the method: http.request.original_method first and if missing, check http.request.method

I expect if we go with A, we will apply this solution in many different places. I'm not against having an additional field to store the original or canonical one if needed, but there must be a field where everything can be queried together.

HTTP SIG meeting,

@lmolkova I unfortunately couldn't join, so hopefully the above does not just repeat what was already discussed, otherwise please ignore.

@lmolkova lmolkova requested a review from a team May 15, 2023 20:32
@lmolkova
Copy link
Contributor Author

Discussed during Semconv SIG meeting:

  • we should preserve attribute semantics consistency across signals (eliminating option C)
  • we can make orinigal_method required on HTTP logs semconv (or perhaps security events semconv?) once they are added

based on this, I updated the PR to make original_method conditionally required for traces. Once we have a log convention that should reference it, we can move the attribute to a common yaml.

@ruflin do you think the following approach could work:

  • a specific semconv for HTTP access/security log would define a set of required attributes
  • it can require http.request.original_method to always be present

I understand it's still a breaking change for ECS users since the attribute is being renamed. It seems the alternatives listed above would not work in the long term once we take other signals, cardinality, and minimal duplication into consideration.

@ruflin
Copy link

ruflin commented May 16, 2023

I'm thinking of the sem-conv like an onion. In the center, we have the shared semconv across all signals. It is the most lenient version of it. Each layer of the onion can become more strict (but not more lenient). What this means, http.request.method would be specified as SHOULD use GET etc. and not MUST. Then traces or metrics could use MUST. I tried to create a quick (ugly) diagram for this:

image

For me this adheres to the base principle mentioned above: "we should preserve attribute semantics consistency across signals".

The main lines in the PR we are discussing are the following. The discussion is if it should be MUST or SHOULD.

If the HTTP request method is not known to the instrumentation, it MUST set the `http.request.method` attribute to `other` and SHOULD populate the exact method passed by client on `http.request.original_method` attribute.

HTTP method names are case-sensitive and `http.request.method` attribute value MUST match a standard (or documented elsewhere) HTTP method name exactly.

If there is agreement in the group to proceed with the proposal, I don't want to block it. One of the reason is, that from my perspective the schema can become more lenient without a breaking change but not the other way around. So my proposed change could still be made later on.

@jsuereth
Copy link
Contributor

jsuereth commented May 16, 2023

@ruflin I think there's a few points maybe not getting across from the semconv WG here.

There are two MAIN uses cases in OTEL that need to be addressed here, and are different than ECS.

  1. X-Signal joins. If signals have different meaning/values for the SAME attributes between logs/metrics/traces it makes cross signal queries and dashboarding much harder. This is one of the big "wins" for using OTEL over individual instrumentation is that we offer signals that can be joined and understood in unison.
  2. Instrumentation. When you go to produce metrics,traces,logs it can be MUCH easier to write a common set of signals to Context in OpenTelemetry and the pull those attributes as you generate each signal. For this, attributes MUST have the same value wherever they're used.

As such, in terms of design space, we CAN play with the following:

  • Some signals have more attributes than others. E .g. metrics require low cardinality, so we expect their attribute set to always be a subset of similar logs/events/spans.
  • We CAN specify having multiple attributes with nuanced meaning, e.g. http.method vs. http.raw_method

But the "layer of the onion" with strictness analogy doesn't really work here. I like the layer of the onion, but think about it in terms of common key-value that can be shared across signals, with each signal being ALLOWED to have more attributes if needed, and each "layer" of your o11y pipeline being able to add more attributes (e.g. otel collector enhancing telemetry with extra information).

Given the need for metric cardinality concerns, I think http.method needs to be the core in the onion with low-cardinality restrictions (what @lmolkova has in this PR). We can then add additional attributes to Logs (where higher cardinality is acceptable) without breaking metric<->log correlation.

@joaopgrassi
Copy link
Member

I read through the original PR discussions and will try to summarize what I understood

  • http.request.method can receive non well-known values
  • Which leads to cardinality problems on metrics BUT also to spans, since we recommend using that for the span name
    If I do aggregation on my span names, then this will also break.

I like the direction of option A here, except, and I quote/agree with @ruflin's comment on the original PR:

having Get under OTHER sounds also confusing to me.

I would be OK with XYZ becoming other but Get I'm not sure. Putting myself in a OTel user shoes, I would rather expect the library/backend to figure this out for me and "map" Get to GET or something standard.

Then I'd like to maybe bring again the suggestion from @Oberon00 as I think it has good value and it was something that also came to mind while I was reading all. With this, we essentially introduce a new attribute and keep recording http.request.method as-is today:

Instead of adding original_method with the unaltered method, I'd suggest adding method_kind with the normalized method and keeping method aligned with ECS and old OTel. Then we only need an adjustment to the span name wording to be generated like method_kind, and probably requirement levels for metrics. This would also align with the proposed http.status_code_class from

Option D

http.request.method represents the original "as-is". We introduce a new attribute http.request.method_canonical that represents the normalized/standard/canonical and basically has the value from https://www.rfc-editor.org/rfc/rfc9110#name-method-definitions

Example:

  • Invalid method:

    • http.request.method = XYZZ
    • http.request.method_canonical = INVALID
  • Non-normalized:

  • http.request.method = Get

  • http.request.method_canonical = GET

  • Valid method:

    • http.request.method = GET
    • http.request.method_canonical = GET

Pros

  • Metric semconv is experimental, so we can still change it to use http.request.method_canonical instead of
  • We address the cardinality problem also present in span names with http.request.method today
  • We don't diverge from ECS, if I got it right, no problem with ecs users and logs with this approach

Cons:

  • In the case of Valid method we have a duplication
  • We will have to change the recommendation of how we populate Span name. It might be "non-breaking" for users, if libraries were already reporting the "canonical" values but maybe not

For the name of the "canonical" attribute, chatGPT suggested

http.request.method.normalized
http.request.method.standard
http.request.method.canonical
http.request.method.normalized_value
http.request.method.normalized_method

@ruflin
Copy link

ruflin commented May 16, 2023

There are two MAIN uses cases in OTEL that need to be addressed here, and are different than ECS.

I think Otel and ECS are perfectly aligned here. ECS was built to exactly allow these cross signal queries. It is more a question on how we approach it. @jsuereth Please let me know if I miss here some fundamental differences.

For Elasticsearch the lenient approach works really well, because even in the data is mixed on ingest time (get, GET, GeT) I can still do strict queries. I understand, other storage engines might not support this that is why there is a preference to be strict in the core and instead deal with it by adding additional attributes. My concern with the attributes addition is that assuming we will do this for quite a few fields, the number of fields in the standard will explode quickly.

If the "addition" of attributes is the path forward, I like Option D proposed by @joaopgrassi

@lmolkova @jsuereth Sorry for the extra rounds on this one. But I think if we settle this it will become easy to apply the same logic to all the future changes by just referring to the discussion here.

@Oberon00
Copy link
Member

Oberon00 commented May 16, 2023

@joaopgrassi

I would be OK with XYZ becoming other but Get I'm not sure. Putting myself in a OTel user shoes, I would rather expect the library/backend to figure this out for me and "map" Get to GET or something standard.

I think somebody else has mentioned this already but it depends on the particular HTTP implementation if Get is treated like GET or like an unknown method. In the latter case, it would be extremely confusing if instrumentations mapped it to GET. However, if the instrumentation has the knowledge whether the instrumented server maps it itself, or ideally even has access to a pre-normalized method, then it could do some sort of "smart" mapping.

@epixa
Copy link

epixa commented May 16, 2023

I'm strongly in favor of Option D. This addresses the concerns I raised in the previous PR about unexpected behavior for security users, but most critically I think it's also the ideal user experience when consuming this data. I hate the idea of users having to make decisions about which fields they query based on the value of other fields, so I like that if a user wants to query the original value then they use the base field, and if they want to accept some context loss in order to benefit from low cardinality, they use that field. In both cases, they just use the one field that makes sense and it just works.

This does mean that we're duplicating data, but I think that should be acceptable if it's truly in the best interests of the overall user experience of consuming the data. Storage mechanisms like Elasticsearch could implement solutions to ensuring duplicate data like this does not incur twice the storage requirements overall - that many don't should be a factor to consider, but it shouldn't be the deciding factor IMO.

@lmolkova lmolkova changed the title Limit HTTP request method cardinality Limit HTTP request method cardinality: original_method (option A) May 16, 2023
@lmolkova
Copy link
Contributor Author

As a comparison, I have implemented option B (which seems to be option D from @joaopgrassi proposal, but canonical method stays case-sensitive) in #34. PTAL and provide feedback or approve if you're onboard.

@reyang
Copy link
Member

reyang commented May 16, 2023

Option E: not doing anything special - if we get a weird HTTP method "XYZ", just keep it.

And here are the reasons:

  1. Doing nothing special is simple.
  2. For logs/traces, keeping the original value would help the user to understand what's going on, if they see something like "OTHER" / "other" / "INVALID" - it'll be hard to dig deeper unless the information is stored somewhere else (e.g. another key-value pair).
  3. For metrics, cardinality problem would exist for any metrics and dimensions, it'll be better to have a generic solution instead of special solutions for each type of data or dimensions (e.g. SQL "INSERT" / "UPDATE" / "DELETE"). I don't think semantic conventions should mandate the cardinality (we can encourage best practices).

@lmolkova
Copy link
Contributor Author

lmolkova commented May 16, 2023

Option E: not doing anything special - if we get a weird HTTP method "XYZ", just keep it.

@reyang it's probably fine for clients, but makes it easy to create some form of ddos attack on metrics when server does not sanitize it. Anyone can send unauthorized requests with random methods. Presumably, there are just a few points for an invalid method and they're not really interesting to anyone (i.e. do not skew any other data), how bad would it be to have a lot of (but quite short) time series?

@reyang
Copy link
Member

reyang commented May 16, 2023

@reyang it's probably fine for clients, but makes it easy to create some form of ddos attack on metrics when server does not sanitize it. Anyone can send unauthorized requests with random methods. Presumably, there are just a few points for an invalid method and they're not really interesting to anyone (i.e. do not skew any other data), how bad would it be to have a lot of (but quite short) time series?

Like I explained in #17 (comment), metrics cardinality is a general problem - nothing specific to HTTP methods. I think we should tackle this issue in the metrics API/SDK space so it can lift all boats (e.g. #2960 is a great example / starting point).

@lmolkova
Copy link
Contributor Author

  • For logs/traces, keeping the original value would help the user to understand what's going on, if they see something like "OTHER" / "other" / "INVALID" - it'll be hard to dig deeper unless the information is stored somewhere else (e.g. another key-value pair).
  • For metrics, cardinality problem would exist for any metrics and dimensions, it'll be better to have a generic solution instead of special solutions for each type of data or dimensions (e.g. SQL "INSERT" / "UPDATE" / "DELETE"). I don't think semantic conventions should mandate the cardinality (we can encourage best practices).

@reyang essentially what you're saying is Option C, but generalized beyond this specific attribute with solution postponed till after HTTP semconv stability? correct?
The proposal might look like "Metric instrumentations SHOULD populate canonical method and populate unknown methods as 'other'" Correct?

Without it, cardinality limits would protect from DDOS attacks, but can make time series that got through less useful.

@reyang
Copy link
Member

reyang commented May 16, 2023

@reyang essentially what you're saying is Option C, but generalized beyond this specific attribute with solution postponed till after HTTP semconv stability? correct?

Yes. I think this is how semantic conventions should be positioned:

  1. HTTP method has a preferred list of values - e.g. "GET", "POST", "PUT", etc.
  2. The typical cardinality is around 10.
  3. The semconv should not say something like "values not listed here MUST be converted to ..." or "values not listed here MUST be dropped" or anything strong (with "MUST" wording).
  4. In the future we might say "instrumentations should provide hint/advice about the preferred list of values".

The API/SDK spec should be positioned to solve the cardinality/reliability problem:

  1. The API should provide a way (e.g. hint/advice) to indicate the cardinality of certain attributes, and what are the preferred values vs. noise.
  2. The SDK should provide mechanisms to control how to deal with noise and cardinality pressure (e.g. dim-capping).
  3. The SDK should provide a reasonable default behavior.

@lmolkova
Copy link
Contributor Author

@reyang essentially what you're saying is Option C, but generalized beyond this specific attribute with solution postponed till after HTTP semconv stability? correct?

Yes. I think this is how semantic conventions should be positioned:

  1. HTTP method has a preferred list of values - e.g. "GET", "POST", "PUT", etc.
  2. The typical cardinality is around 10.
  3. The semconv should not say something like "values not listed here MUST be converted to ..." or "values not listed here MUST be dropped" or anything strong (with "MUST" wording).
  4. In the future we might say "instrumentations should provide hint/advice about the preferred list of values".

The API/SDK spec should be positioned to solve the cardinality/reliability problem:

  1. The API should provide a way (e.g. hint/advice) to indicate the cardinality of certain attributes, and what are the preferred values vs. noise.
  2. The SDK should provide mechanisms to control how to deal with noise and cardinality pressure (e.g. dim-capping).
  3. THe SDK should provide a reasonable default behavior.

There was a general push-back to have different semantics for different signals. Dim-capping on metrics can be considered as such inconsistency. cc @jsuereth @arminru

Still, I drafted it here: open-telemetry/opentelemetry-specification#35 with the caveat that I don't believe instrumentation should provide hints on HTTP methods since we have a standard and common 10 methods that are used in the absolute majority of requests.

the drawback of this change is that span name is not quite as specific as it was before, but I don't see a huge harm even if the span name is of high cardinality for invalid requests. group-by, top-N queries should still work great

@jsuereth jsuereth mentioned this pull request May 22, 2023
Copy link
Member

@joaopgrassi joaopgrassi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need a changelog for this now :)

@trask
Copy link
Member

trask commented Jun 19, 2023

@Oberon00 do you have any remaining concerns that you'd like to see addressed before merging?

semantic_conventions/http-common.yaml Outdated Show resolved Hide resolved
semantic_conventions/http-common.yaml Outdated Show resolved Hide resolved
@Oberon00
Copy link
Member

I guess we are aware that this will basically pre-determine the general "_original" naming convention and the "_OTHER" value that open-telemetry/opentelemetry-specification#117 asks for, as soon as HTTP is marked stable + a release is cut. I think that's fine from a priority perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.